Skip to content

DIG-2092: optimizing variant search#374

Merged
daisieh merged 7 commits intodevelopfrom
daisieh/optimize-search
Oct 29, 2025
Merged

DIG-2092: optimizing variant search#374
daisieh merged 7 commits intodevelopfrom
daisieh/optimize-search

Conversation

@daisieh
Copy link
Member

@daisieh daisieh commented Oct 28, 2025

I made the following optimizations:

  • I changed the primary key in the pos_bucket table to be a composite key on the contig and the pos_bucket_id, which is the start position of the bucket on that chromosome. The migration file pr_374.sql updates existing tables to match this.
  • I only look up the refseq for the chromosome once instead of a billion times.
  • I removed the JOIN for the header table if there's no header present in the search terms.
  • I only try to get the vcf file for looking up samples and experiments once per search by saving the results in experiment_dict.

Do a benchmark test on the develop branch: set up a console to catch the log, something like:

tail -f tmp/logs/buffer.*.log | grep candigv2_htsget_1

Then first run the test search

curl "http://candig.docker.internal:5080/genomics/beacon/v2/g_variants?assemblyId=hg38&start=9560000&end=9760000&referenceName=21&fullSearch=true" \
     -H 'Authorization: Bearer <site admin token>' \
     -H 'Content-Type: application/json; charset=utf-8'

and see how long it takes to run that full search. You're looking for a line like

2025-10-28T04:28:00+00:00	candigv2_htsget_1	{"source":"stdout","log":"level: DEBUG, file: watchdog.observers.inotify_buffer, log: in-event <InotifyEvent: src_path=b'/home/candig/tmp/search/to_search/7a4ce858-b3b6-11f0-a253-0242ac120009', wd=1, mask=IN_DELETE, cookie=0, name='7a4ce858-b3b6-11f0-a253-0242ac120009'>"}

to show that the full search is complete. On my machine, the timestamps are 4 seconds apart.

Then set up this branch. Run make recompose-htsget and make test-integration to get test data into the system.
Rerun the test search and check the timestamps again. On my machine, both timestamps are now within the same second.

Make sure both tests give you the same result: you should get a result like the following:

{
  "beaconResultUrl": "http://candig.docker.internal:5080/genomics/beacon/v2/result/7ed4f58c-b3a2-11f0-aa57-0242ac120009",
  "estimatedResults": {
    "local-SYNTH_01": [
      {
        "submitter_sample_id": "local-SAMPLE_ALL_0002",
        "variant_count": 185
      }
    ]
  },
  "estimatedSearchParameters": {
    "end": 9769999,
    "referenceName": "21",
    "start": 9560000
  },
  "meta": {
    "apiVersion": "1.0.0",
    "beaconId": "org.candig.htsget.beacon",
    "receivedRequestSummary": {
      "apiVersion": "1.0.0",
      "pagination": {
        "limit": 10,
        "skip": 0
      },
      "requestParameters": {
        "assembly_id": "hg38",
        "end": 9760000,
        "full_search": true,
        "reference_genome": "hg38",
        "reference_name": "21",
        "start": 9560000
      },
      "requestedGranularity": "record",
      "requestedSchemas": [
        {
          "entityType": "genomicVariant",
          "schema": "ga4gh-beacon-variant-v2.0.0"
        }
      ]
    }
  }
}

and if you follow the beaconResultUrl, you should get a very long result that has

  "responseSummary": {
    "exists": true,
    "numTotalResults": 295
  }

at the end.

@daisieh daisieh force-pushed the daisieh/optimize-search branch from 6e0e179 to 4c2ea55 Compare October 28, 2025 02:11
@daisieh daisieh force-pushed the daisieh/optimize-search branch from 4c2ea55 to c7f7f20 Compare October 28, 2025 04:14
@daisieh daisieh requested a review from SonQBChau October 28, 2025 04:30
Copy link
Contributor

@SonQBChau SonQBChau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested it on my machine. The old version took about 5 seconds, while the new one did instantly. I think the log time might not accurately reflect the actual db query runtime, but the change from a regular to a composite key makes sense and should speed up join queries. Other changes also look good to me!

@daisieh daisieh merged commit 5c348ee into develop Oct 29, 2025
1 check passed
@daisieh daisieh deleted the daisieh/optimize-search branch December 19, 2025 21:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants